House Prices Estimator

Note: It's a competition from Kaggle.com and the input data was retrieved from there.

Details

Goal

It is your job to predict the sales price for each house. For each Id in the test set, you must predict the value of the SalePrice variable.

Metric

Submissions are evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price. (Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.)

Submission File Format

The file should contain a header and have the following format:

Id,SalePrice 1461,169000.1 1462,187724.1233 1463,175221 etc.

TODO

Use another algorithm to predict the house price
More feature engineering
Add more comments, thoughts, conclusions, ...
Come up with new ideas..

Data Analysis



In [1]:

    
import numpy as np
import pandas as pd

#load the files
train = pd.read_csv('input/train.csv')
test = pd.read_csv('input/test.csv')
data = pd.concat([train, test])

#size of training dataset
train_samples = train.shape[0]

#print some of them
data.head()









    Out[1]:







  
    
      
      1stFlrSF
      2ndFlrSF
      3SsnPorch
      Alley
      BedroomAbvGr
      BldgType
      BsmtCond
      BsmtExposure
      BsmtFinSF1
      BsmtFinSF2
      ...
      SaleType
      ScreenPorch
      Street
      TotRmsAbvGrd
      TotalBsmtSF
      Utilities
      WoodDeckSF
      YearBuilt
      YearRemodAdd
      YrSold
    
  
  
    
      0
      856
      854
      0
      NaN
      3
      1Fam
      TA
      No
      706.0
      0.0
      ...
      WD
      0
      Pave
      8
      856.0
      AllPub
      0
      2003
      2003
      2008
    
    
      1
      1262
      0
      0
      NaN
      3
      1Fam
      TA
      Gd
      978.0
      0.0
      ...
      WD
      0
      Pave
      6
      1262.0
      AllPub
      298
      1976
      1976
      2007
    
    
      2
      920
      866
      0
      NaN
      3
      1Fam
      TA
      Mn
      486.0
      0.0
      ...
      WD
      0
      Pave
      6
      920.0
      AllPub
      0
      2001
      2002
      2008
    
    
      3
      961
      756
      0
      NaN
      3
      1Fam
      Gd
      No
      216.0
      0.0
      ...
      WD
      0
      Pave
      7
      756.0
      AllPub
      0
      1915
      1970
      2006
    
    
      4
      1145
      1053
      0
      NaN
      4
      1Fam
      TA
      Av
      655.0
      0.0
      ...
      WD
      0
      Pave
      9
      1145.0
      AllPub
      192
      2000
      2000
      2008
    
  

5 rows × 81 columns



In [2]:

    
# remove the Id feature
data.drop(['Id'],1, inplace=True);



In [3]:

    
data.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 2919 entries, 0 to 1458
Data columns (total 80 columns):
1stFlrSF         2919 non-null int64
2ndFlrSF         2919 non-null int64
3SsnPorch        2919 non-null int64
Alley            198 non-null object
BedroomAbvGr     2919 non-null int64
BldgType         2919 non-null object
BsmtCond         2837 non-null object
BsmtExposure     2837 non-null object
BsmtFinSF1       2918 non-null float64
BsmtFinSF2       2918 non-null float64
BsmtFinType1     2840 non-null object
BsmtFinType2     2839 non-null object
BsmtFullBath     2917 non-null float64
BsmtHalfBath     2917 non-null float64
BsmtQual         2838 non-null object
BsmtUnfSF        2918 non-null float64
CentralAir       2919 non-null object
Condition1       2919 non-null object
Condition2       2919 non-null object
Electrical       2918 non-null object
EnclosedPorch    2919 non-null int64
ExterCond        2919 non-null object
ExterQual        2919 non-null object
Exterior1st      2918 non-null object
Exterior2nd      2918 non-null object
Fence            571 non-null object
FireplaceQu      1499 non-null object
Fireplaces       2919 non-null int64
Foundation       2919 non-null object
FullBath         2919 non-null int64
Functional       2917 non-null object
GarageArea       2918 non-null float64
GarageCars       2918 non-null float64
GarageCond       2760 non-null object
GarageFinish     2760 non-null object
GarageQual       2760 non-null object
GarageType       2762 non-null object
GarageYrBlt      2760 non-null float64
GrLivArea        2919 non-null int64
HalfBath         2919 non-null int64
Heating          2919 non-null object
HeatingQC        2919 non-null object
HouseStyle       2919 non-null object
KitchenAbvGr     2919 non-null int64
KitchenQual      2918 non-null object
LandContour      2919 non-null object
LandSlope        2919 non-null object
LotArea          2919 non-null int64
LotConfig        2919 non-null object
LotFrontage      2433 non-null float64
LotShape         2919 non-null object
LowQualFinSF     2919 non-null int64
MSSubClass       2919 non-null int64
MSZoning         2915 non-null object
MasVnrArea       2896 non-null float64
MasVnrType       2895 non-null object
MiscFeature      105 non-null object
MiscVal          2919 non-null int64
MoSold           2919 non-null int64
Neighborhood     2919 non-null object
OpenPorchSF      2919 non-null int64
OverallCond      2919 non-null int64
OverallQual      2919 non-null int64
PavedDrive       2919 non-null object
PoolArea         2919 non-null int64
PoolQC           10 non-null object
RoofMatl         2919 non-null object
RoofStyle        2919 non-null object
SaleCondition    2919 non-null object
SalePrice        1460 non-null float64
SaleType         2918 non-null object
ScreenPorch      2919 non-null int64
Street           2919 non-null object
TotRmsAbvGrd     2919 non-null int64
TotalBsmtSF      2918 non-null float64
Utilities        2917 non-null object
WoodDeckSF       2919 non-null int64
YearBuilt        2919 non-null int64
YearRemodAdd     2919 non-null int64
YrSold           2919 non-null int64
dtypes: float64(12), int64(25), object(43)
memory usage: 1.8+ MB

First problem

The training and test datasets have almost the same size.



In [4]:

    
print("Size training: {}".format(train.shape[0]))
print("Size testing: {}".format(test.shape[0]))









    



Size training: 1460
Size testing: 1459

Selecting only numeric columns (by now)



In [5]:

    
datanum = data.select_dtypes([np.number])

datanum.describe()









    Out[5]:







  
    
      
      1stFlrSF
      2ndFlrSF
      3SsnPorch
      BedroomAbvGr
      BsmtFinSF1
      BsmtFinSF2
      BsmtFullBath
      BsmtHalfBath
      BsmtUnfSF
      EnclosedPorch
      ...
      OverallQual
      PoolArea
      SalePrice
      ScreenPorch
      TotRmsAbvGrd
      TotalBsmtSF
      WoodDeckSF
      YearBuilt
      YearRemodAdd
      YrSold
    
  
  
    
      count
      2919.000000
      2919.000000
      2919.000000
      2919.000000
      2918.000000
      2918.000000
      2917.000000
      2917.000000
      2918.000000
      2919.000000
      ...
      2919.000000
      2919.000000
      1460.000000
      2919.000000
      2919.000000
      2918.000000
      2919.000000
      2919.000000
      2919.000000
      2919.000000
    
    
      mean
      1159.581706
      336.483727
      2.602261
      2.860226
      441.423235
      49.582248
      0.429894
      0.061364
      560.772104
      23.098321
      ...
      6.089072
      2.251799
      180921.195890
      16.062350
      6.451524
      1051.777587
      93.709832
      1971.312778
      1984.264474
      2007.792737
    
    
      std
      392.362079
      428.701456
      25.188169
      0.822693
      455.610826
      169.205611
      0.524736
      0.245687
      439.543659
      64.244246
      ...
      1.409947
      35.663946
      79442.502883
      56.184365
      1.569379
      440.766258
      126.526589
      30.291442
      20.894344
      1.314964
    
    
      min
      334.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      ...
      1.000000
      0.000000
      34900.000000
      0.000000
      2.000000
      0.000000
      0.000000
      1872.000000
      1950.000000
      2006.000000
    
    
      25%
      876.000000
      0.000000
      0.000000
      2.000000
      0.000000
      0.000000
      0.000000
      0.000000
      220.000000
      0.000000
      ...
      5.000000
      0.000000
      129975.000000
      0.000000
      5.000000
      793.000000
      0.000000
      1953.500000
      1965.000000
      2007.000000
    
    
      50%
      1082.000000
      0.000000
      0.000000
      3.000000
      368.500000
      0.000000
      0.000000
      0.000000
      467.000000
      0.000000
      ...
      6.000000
      0.000000
      163000.000000
      0.000000
      6.000000
      989.500000
      0.000000
      1973.000000
      1993.000000
      2008.000000
    
    
      75%
      1387.500000
      704.000000
      0.000000
      3.000000
      733.000000
      0.000000
      1.000000
      0.000000
      805.500000
      0.000000
      ...
      7.000000
      0.000000
      214000.000000
      0.000000
      7.000000
      1302.000000
      168.000000
      2001.000000
      2004.000000
      2009.000000
    
    
      max
      5095.000000
      2065.000000
      508.000000
      8.000000
      5644.000000
      1526.000000
      3.000000
      2.000000
      2336.000000
      1012.000000
      ...
      10.000000
      800.000000
      755000.000000
      576.000000
      15.000000
      6110.000000
      1424.000000
      2010.000000
      2010.000000
      2010.000000
    
  

8 rows × 37 columns



In [6]:

    
data.select_dtypes(exclude=[np.number]).head()









    Out[6]:







  
    
      
      Alley
      BldgType
      BsmtCond
      BsmtExposure
      BsmtFinType1
      BsmtFinType2
      BsmtQual
      CentralAir
      Condition1
      Condition2
      ...
      MiscFeature
      Neighborhood
      PavedDrive
      PoolQC
      RoofMatl
      RoofStyle
      SaleCondition
      SaleType
      Street
      Utilities
    
  
  
    
      0
      NaN
      1Fam
      TA
      No
      GLQ
      Unf
      Gd
      Y
      Norm
      Norm
      ...
      NaN
      CollgCr
      Y
      NaN
      CompShg
      Gable
      Normal
      WD
      Pave
      AllPub
    
    
      1
      NaN
      1Fam
      TA
      Gd
      ALQ
      Unf
      Gd
      Y
      Feedr
      Norm
      ...
      NaN
      Veenker
      Y
      NaN
      CompShg
      Gable
      Normal
      WD
      Pave
      AllPub
    
    
      2
      NaN
      1Fam
      TA
      Mn
      GLQ
      Unf
      Gd
      Y
      Norm
      Norm
      ...
      NaN
      CollgCr
      Y
      NaN
      CompShg
      Gable
      Normal
      WD
      Pave
      AllPub
    
    
      3
      NaN
      1Fam
      Gd
      No
      ALQ
      Unf
      TA
      Y
      Norm
      Norm
      ...
      NaN
      Crawfor
      Y
      NaN
      CompShg
      Gable
      Abnorml
      WD
      Pave
      AllPub
    
    
      4
      NaN
      1Fam
      TA
      Av
      GLQ
      Unf
      Gd
      Y
      Norm
      Norm
      ...
      NaN
      NoRidge
      Y
      NaN
      CompShg
      Gable
      Normal
      WD
      Pave
      AllPub
    
  

5 rows × 43 columns

Find if there's null values



In [7]:

    
datanum.columns[datanum.isnull().any()].tolist()









    Out[7]:





['BsmtFinSF1',
 'BsmtFinSF2',
 'BsmtFullBath',
 'BsmtHalfBath',
 'BsmtUnfSF',
 'GarageArea',
 'GarageCars',
 'GarageYrBlt',
 'LotFrontage',
 'MasVnrArea',
 'SalePrice',
 'TotalBsmtSF']



In [8]:

    
#number of row without NaN
print(datanum.shape[0] - datanum.dropna().shape[0])



In [9]:

    
#list of columns with NaN
datanum.columns[datanum.isnull().any()].tolist()









    Out[9]:





['BsmtFinSF1',
 'BsmtFinSF2',
 'BsmtFullBath',
 'BsmtHalfBath',
 'BsmtUnfSF',
 'GarageArea',
 'GarageCars',
 'GarageYrBlt',
 'LotFrontage',
 'MasVnrArea',
 'SalePrice',
 'TotalBsmtSF']



In [10]:

    
#Filling with the mean
datanum_no_nan = datanum.fillna(datanum.mean())

#check
datanum_no_nan.columns[datanum_no_nan.isnull().any()].tolist()









    Out[10]:





[]

Normalizing



In [11]:

    
import matplotlib.pyplot as plt

datanum_no_nan.drop(['SalePrice'], axis=1).head(15).plot()
plt.show()



In [12]:

    
#Squeeze the data to [0,1]
from sklearn import preprocessing

scaler = preprocessing.MinMaxScaler()
columns = datanum_no_nan.columns
columns = columns.drop('SalePrice')
print("Features: {}".format(columns))

data_norm = datanum_no_nan









    



Features: Index(['1stFlrSF', '2ndFlrSF', '3SsnPorch', 'BedroomAbvGr', 'BsmtFinSF1',
       'BsmtFinSF2', 'BsmtFullBath', 'BsmtHalfBath', 'BsmtUnfSF',
       'EnclosedPorch', 'Fireplaces', 'FullBath', 'GarageArea', 'GarageCars',
       'GarageYrBlt', 'GrLivArea', 'HalfBath', 'KitchenAbvGr', 'LotArea',
       'LotFrontage', 'LowQualFinSF', 'MSSubClass', 'MasVnrArea', 'MiscVal',
       'MoSold', 'OpenPorchSF', 'OverallCond', 'OverallQual', 'PoolArea',
       'ScreenPorch', 'TotRmsAbvGrd', 'TotalBsmtSF', 'WoodDeckSF', 'YearBuilt',
       'YearRemodAdd', 'YrSold'],
      dtype='object')



In [13]:

    
data_norm[columns] = scaler.fit_transform(datanum_no_nan[columns])
print("Train shape: {}".format(data_norm.shape))

data_norm.drop(['SalePrice'], axis=1).head(15).plot()
plt.show()









    



Train shape: (2919, 37)



In [14]:

    
data_norm.describe().T









    Out[14]:







  
    
      
      count
      mean
      std
      min
      25%
      50%
      75%
      max
    
  
  
    
      1stFlrSF
      2919.0
      0.173405
      0.082412
      0.0
      0.113842
      0.157110
      0.221277
      1.0
    
    
      2ndFlrSF
      2919.0
      0.162946
      0.207604
      0.0
      0.000000
      0.000000
      0.340920
      1.0
    
    
      3SsnPorch
      2919.0
      0.005123
      0.049583
      0.0
      0.000000
      0.000000
      0.000000
      1.0
    
    
      BedroomAbvGr
      2919.0
      0.357528
      0.102837
      0.0
      0.250000
      0.375000
      0.375000
      1.0
    
    
      BsmtFinSF1
      2919.0
      0.078211
      0.080711
      0.0
      0.000000
      0.065379
      0.129872
      1.0
    
    
      BsmtFinSF2
      2919.0
      0.032492
      0.110863
      0.0
      0.000000
      0.000000
      0.000000
      1.0
    
    
      BsmtFullBath
      2919.0
      0.143298
      0.174852
      0.0
      0.000000
      0.000000
      0.333333
      1.0
    
    
      BsmtHalfBath
      2919.0
      0.030682
      0.122801
      0.0
      0.000000
      0.000000
      0.000000
      1.0
    
    
      BsmtUnfSF
      2919.0
      0.240057
      0.188129
      0.0
      0.094178
      0.199914
      0.344606
      1.0
    
    
      EnclosedPorch
      2919.0
      0.022824
      0.063482
      0.0
      0.000000
      0.000000
      0.000000
      1.0
    
    
      Fireplaces
      2919.0
      0.149281
      0.161532
      0.0
      0.000000
      0.250000
      0.250000
      1.0
    
    
      FullBath
      2919.0
      0.392001
      0.138242
      0.0
      0.250000
      0.500000
      0.500000
      1.0
    
    
      GarageArea
      2919.0
      0.317792
      0.144730
      0.0
      0.215054
      0.322581
      0.387097
      1.0
    
    
      GarageCars
      2919.0
      0.353324
      0.152299
      0.0
      0.200000
      0.400000
      0.400000
      1.0
    
    
      GarageYrBlt
      2919.0
      0.266389
      0.079704
      0.0
      0.213141
      0.266389
      0.339744
      1.0
    
    
      GrLivArea
      2919.0
      0.219812
      0.095337
      0.0
      0.149209
      0.209118
      0.265543
      1.0
    
    
      HalfBath
      2919.0
      0.190134
      0.251436
      0.0
      0.000000
      0.000000
      0.500000
      1.0
    
    
      KitchenAbvGr
      2919.0
      0.348179
      0.071487
      0.0
      0.333333
      0.333333
      0.333333
      1.0
    
    
      LotArea
      2919.0
      0.041450
      0.036865
      0.0
      0.028877
      0.038108
      0.048003
      1.0
    
    
      LotFrontage
      2919.0
      0.165431
      0.072987
      0.0
      0.133562
      0.165431
      0.195205
      1.0
    
    
      LowQualFinSF
      2919.0
      0.004412
      0.043606
      0.0
      0.000000
      0.000000
      0.000000
      1.0
    
    
      MSSubClass
      2919.0
      0.218457
      0.250104
      0.0
      0.000000
      0.176471
      0.294118
      1.0
    
    
      MasVnrArea
      2919.0
      0.063876
      0.111641
      0.0
      0.000000
      0.000000
      0.102188
      1.0
    
    
      MiscVal
      2919.0
      0.002990
      0.033377
      0.0
      0.000000
      0.000000
      0.000000
      1.0
    
    
      MoSold
      2919.0
      0.473917
      0.246797
      0.0
      0.272727
      0.454545
      0.636364
      1.0
    
    
      OpenPorchSF
      2919.0
      0.063998
      0.091072
      0.0
      0.000000
      0.035040
      0.094340
      1.0
    
    
      OverallCond
      2919.0
      0.570572
      0.139141
      0.0
      0.500000
      0.500000
      0.625000
      1.0
    
    
      OverallQual
      2919.0
      0.565452
      0.156661
      0.0
      0.444444
      0.555556
      0.666667
      1.0
    
    
      PoolArea
      2919.0
      0.002815
      0.044580
      0.0
      0.000000
      0.000000
      0.000000
      1.0
    
    
      SalePrice
      2919.0
      180921.195890
      56174.332503
      34900.0
      163000.000000
      180921.195890
      180921.195890
      755000.0
    
    
      ScreenPorch
      2919.0
      0.027886
      0.097542
      0.0
      0.000000
      0.000000
      0.000000
      1.0
    
    
      TotRmsAbvGrd
      2919.0
      0.342425
      0.120721
      0.0
      0.230769
      0.307692
      0.384615
      1.0
    
    
      TotalBsmtSF
      2919.0
      0.172140
      0.072126
      0.0
      0.129787
      0.162029
      0.213093
      1.0
    
    
      WoodDeckSF
      2919.0
      0.065807
      0.088853
      0.0
      0.000000
      0.000000
      0.117978
      1.0
    
    
      YearBuilt
      2919.0
      0.719658
      0.219503
      0.0
      0.590580
      0.731884
      0.934783
      1.0
    
    
      YearRemodAdd
      2919.0
      0.571075
      0.348239
      0.0
      0.250000
      0.716667
      0.900000
      1.0
    
    
      YrSold
      2919.0
      0.448184
      0.328741
      0.0
      0.250000
      0.500000
      0.750000
      1.0



In [15]:

    
#plotting distributions of numeric features
data_norm.hist(bins=50, figsize=(22,16))
plt.show()

Using Box-Cox



In [16]:

    
data_norm['1stFlrSF'].hist()
plt.show()



In [17]:

    
#transform the data so it's closest to normal
from scipy import stats

data_gauss = data_norm.copy()

for f in datanum.columns.tolist():
    data_gauss[f], _ = stats.boxcox(data_gauss[f]+0.01)

#rescale again
std_scaler = preprocessing.StandardScaler()
data_gauss[columns] = std_scaler.fit_transform(data_gauss[columns])
    
data_gauss['1stFlrSF'].hist()
plt.show()



In [18]:

    
#plotting distributions of numeric features
data_gauss.hist(bins=50, figsize=(22,16))
plt.show()

Splitting dataset in train and test (getting batches)



In [19]:

    
#include no numbers columns
data.select_dtypes(exclude=[np.number]).head()
data_categorical = pd.get_dummies(data.select_dtypes(exclude=[np.number]))
data_all = pd.concat([data_norm, data_categorical], axis=1)

Selecting good features...



In [20]:

    
#data_norm.columns.tolist()

feat_list = ['1stFlrSF',
 #'2ndFlrSF',
 #'3SsnPorch',
 'BedroomAbvGr',
 'BsmtFinSF1',
 #'BsmtFinSF2',
 #'BsmtFullBath',
 #'BsmtHalfBath',
 'BsmtUnfSF',
 #'EnclosedPorch',
 #'Fireplaces',
 #'FullBath',
 'GarageArea',
 'GarageCars',
 'GarageYrBlt',
 #'GrLivArea',
 #'HalfBath',
 #'KitchenAbvGr',
 'LotArea',
 'LotFrontage',
 #'LowQualFinSF',
 'MSSubClass',
 'MasVnrArea',
 #'MiscVal',
 'MoSold',
 'OpenPorchSF',
 'OverallCond',
 'OverallQual',
 'PoolArea',
 #'SalePrice',
 #'ScreenPorch',
 'TotRmsAbvGrd',
 'TotalBsmtSF',
 'WoodDeckSF',
 'YearBuilt',
 'YearRemodAdd']
 #'YrSold']



In [21]:

    
%matplotlib inline
import seaborn as sns
fig = plt.figure(figsize=(14, 10))
sns.heatmap(data_norm[feat_list+['SalePrice']].corr())









    Out[21]:





<matplotlib.axes._subplots.AxesSubplot at 0x114ae64e0>



In [22]:

    
#heatmap
fig = plt.figure(figsize=(14, 10))
sns.heatmap(data_norm.corr())









    Out[22]:





<matplotlib.axes._subplots.AxesSubplot at 0x10eafc6a0>



In [23]:

    
# Correlation features
data_norm.corr()['SalePrice'].sort_values().tail(13)









    Out[23]:





Fireplaces      0.329421
MasVnrArea      0.339679
YearRemodAdd    0.354302
YearBuilt       0.368664
TotRmsAbvGrd    0.390869
FullBath        0.394977
1stFlrSF        0.422097
TotalBsmtSF     0.431912
GarageArea      0.437654
GarageCars      0.444406
GrLivArea       0.520311
OverallQual     0.548617
SalePrice       1.000000
Name: SalePrice, dtype: float64



In [24]:

    
feat_low_corr = ['KitchenAbvGr',
                 'EnclosedPorch',
                 'MSSubClass',
                 'OverallCond',
                 'YrSold',
                 'LowQualFinSF',
                 'MiscVal',
                 'BsmtHalfBath',
                 'BsmtFinSF2',
                 'MoSold',
                 '3SsnPorch',
                 'PoolArea',
                 'ScreenPorch']

feat_high_corr = ['Fireplaces',
                  'MasVnrArea',
                  'YearRemodAdd',
                  'YearBuilt',
                  'TotRmsAbvGrd',
                  'FullBath',
                  '1stFlrSF',
                  'TotalBsmtSF',
                  'GarageArea',
                  'GarageCars',
                  'GrLivArea',
                  'OverallQual']

data_norm_low_corr = data_norm[feat_low_corr]
data_norm_high_corr = data_norm[feat_high_corr]

KFold



In [152]:

    
from sklearn.model_selection import KFold

y = np.array(data_all['SalePrice'])
X = np.array(data_norm_high_corr)

#split by idx
idx = train_samples
X_train, X_test = X[:idx], X[idx:]
y_train, y_test = y[:idx], y[idx:]

print("Shape X train: {}".format(X_train.shape))
print("Shape y train: {}".format(y_train.shape))
print("Shape X test: {}".format(X_test.shape))
print("Shape y test: {}".format(y_test.shape))

kf = KFold(n_splits=3, random_state=9, shuffle=True)
print(kf)









    



Shape X train: (1460, 12)
Shape y train: (1460,)
Shape X test: (1459, 12)
Shape y test: (1459,)
KFold(n_splits=3, random_state=9, shuffle=True)

Anomaly Detection



In [153]:

    
#plotting PCA
from sklearn.decomposition import PCA

def plotPCA(X, y):
    pca = PCA(n_components=1)
    X_r = pca.fit(X).transform(X)
    plt.plot(X_r, y, 'x')



In [154]:

    
from sklearn.covariance import EllipticEnvelope

# fit the model
ee = EllipticEnvelope(contamination=0.05,
                      assume_centered=True,
                      random_state=9)
ee.fit(X_train)
pred = ee.predict(X_train)

X_train = X_train[pred == 1]
y_train = y_train[pred == 1]
print(X_train.shape)
print(y_train.shape)

#after removing anomalies
plotPCA(X_train, y_train)









    



(1387, 12)
(1387,)

Models

Multilayer Perceptron



In [155]:

    
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import mean_squared_error

rf = MLPRegressor(activation='relu',
                  solver='lbfgs',
                  #learning_rate_init=1e-2,
                  #learning_rate='adaptive',
                  #alpha=0.0001,
                  max_iter=400,
                  #shuffle=True,
                  hidden_layer_sizes=(64,64),
                  warm_start=True,
                  random_state=9,
                  verbose=False)

for e in range(1):
    batch = 1;
    for train_idx, val_idx in kf.split(X_train, y_train):
        X_t, X_v = X_train[train_idx], X_train[val_idx]
        y_t, y_v = y_train[train_idx], y_train[val_idx]

        #training
        rf.fit(X_t, y_t)

        #calculate costs
        t_error = mean_squared_error(y_t, rf.predict(X_t))**0.5
        v_error = mean_squared_error(y_v, rf.predict(X_v))**0.5
        print("{}-{}) Training error: {:.2f}  Validation error: {:.2f}".format(e, batch, t_error, v_error))
        batch += 1

#Scores
print("Training score: {:.4f}".format(rf.score(X_train, y_train)))









    



0-1) Training error: 28183.71  Validation error: 30444.11
0-2) Training error: 20297.65  Validation error: 62717.58
0-3) Training error: 22571.34  Validation error: 26189.24
Training score: 0.9093



In [181]:

    
# Gradient boosting
from sklearn import ensemble

params = {'n_estimators': 100, 'max_depth': 50, 'min_samples_split': 5,
          'learning_rate': 0.1, 'loss': 'ls', 'random_state':9, 'warm_start':True}

gbr = ensemble.GradientBoostingRegressor(**params)

batch = 0
for train_idx, val_idx in kf.split(X_train, y_train):
    X_t, X_v = X_train[train_idx], X_train[val_idx]
    y_t, y_v = y_train[train_idx], y_train[val_idx]

    #training
    gbr.fit(X_t, y_t)

    #calculate costs
    t_error = mean_squared_error(y_t, gbr.predict(X_t))**0.5
    v_error = mean_squared_error(y_v, gbr.predict(X_v))**0.5
    print("{}) Training error: {:.2f}  Validation error: {:.2f}".format(batch, t_error, v_error))
    batch += 1

#Scores
print("Training score: {:.4f}".format(gbr.score(X_train, y_train)))









    



0) Training error: 625.88  Validation error: 32930.52
1) Training error: 11004.55  Validation error: 805.72
2) Training error: 11007.83  Validation error: 12.10
3) Training error: 11007.60  Validation error: 212.82
4) Training error: 11003.23  Validation error: 954.25
5) Training error: 11007.83  Validation error: 7.78
6) Training error: 11007.06  Validation error: 391.90
7) Training error: 10995.31  Validation error: 1271.39
8) Training error: 11002.65  Validation error: 393.04
9) Training error: 11003.43  Validation error: 7.58
Training score: 0.9826



In [157]:

    
# AdaBoost
from sklearn.ensemble import AdaBoostRegressor
from sklearn.tree import DecisionTreeRegressor

abr = AdaBoostRegressor(DecisionTreeRegressor(max_depth=50),
                        n_estimators=100, random_state=9)

batch = 0
for train_idx, val_idx in kf.split(X_train, y_train):
    X_t, X_v = X_train[train_idx], X_train[val_idx]
    y_t, y_v = y_train[train_idx], y_train[val_idx]

    #training
    abr.fit(X_t, y_t)

    #calculate costs
    t_error = mean_squared_error(y_t, abr.predict(X_t))**0.5
    v_error = mean_squared_error(y_v, abr.predict(X_v))**0.5
    print("{}) Training error: {:.2f}  Validation error: {:.2f}".format(batch, t_error, v_error))
    batch += 1

#Scores
print("Training score: {:.4f}".format(abr.score(X_train, y_train)))









    



0) Training error: 2360.56  Validation error: 31521.29
1) Training error: 1798.02  Validation error: 40457.58
2) Training error: 2063.63  Validation error: 26834.71
Training score: 0.9612



In [158]:

    
# Lasso
from sklearn.linear_model import Lasso

lr = Lasso()

batch = 0
for train_idx, val_idx in kf.split(X_train, y_train):
    X_t, X_v = X_train[train_idx], X_train[val_idx]
    y_t, y_v = y_train[train_idx], y_train[val_idx]

    #training
    lr.fit(X_t, y_t)

    #calculate costs
    t_error = mean_squared_error(y_t, lr.predict(X_t))**0.5
    v_error = mean_squared_error(y_v, lr.predict(X_v))**0.5
    print("{}) Training error: {:.2f}  Validation error: {:.2f}".format(batch, t_error, v_error))
    batch += 1

#Scores
print("Training score: {:.4f}".format(lr.score(X_train, y_train)))









    



0) Training error: 34984.31  Validation error: 32420.25
1) Training error: 30681.71  Validation error: 40792.49
2) Training error: 35557.70  Validation error: 31194.92
Training score: 0.8136

Stacked model



In [178]:

    
### Testing
### Ada + mlp + gradient boosting -> level 1 predictions
### level 1 -> mlp -> level 2 predictions (final)

# Training
#mlp1 = MLPRegressor(activation='logistic',
#                  solver='sgd',
#                  hidden_layer_sizes=(5,5),
#                  learning_rate='adaptive',
#                  random_state=9,
#                  warm_start=True,
#                  verbose=False)

from sklearn.linear_model import LogisticRegression
mlp = LogisticRegression(random_state=9)

sclr = preprocessing.StandardScaler()

def stack_training(X, y):
    X0 = rf.predict(X)
    X1 = gbr.predict(X)
    X2 = abr.predict(X)
    X3 = lr.predict(X)
    Xt = np.array([X0, X1, X2, X3]).T
    #Xt = np.array([X0, X1, X2, X3, X1+X3, X2*X3, X0*X2*X3, X0/X2, X1/X3, X0/X3, (X0+X1+X2+X3)/4]).T
    Xt = sclr.fit_transform(Xt)
    mlp.fit(Xt, y)

def stack_predict(X, verbose=False):
    X0 = rf.predict(X)
    X1 = gbr.predict(X)
    X2 = abr.predict(X)
    X3 = lr.predict(X)
    Xt = np.array([X0, X1, X2, X3]).T
    #Xt = np.array([X0, X1, X2, X3, X1+X3, X2*X3, X0*X2*X3, X0/X2, X1/X3, X0/X3, (X0+X1+X2+X3)/4]).T
    Xt = sclr.transform(Xt)
    if verbose:
        print("Training score: {:.4f}".format(mlp.score(Xt, y_train)))
        plotPCA(Xt, y_train)
    return mlp.predict(Xt)

#
batch = 0
kf = KFold(n_splits=10, random_state=9, shuffle=True)
for train_idx, val_idx in kf.split(X_train, y_train):
    X_t, X_v = X_train[train_idx], X_train[val_idx]
    y_t, y_v = y_train[train_idx], y_train[val_idx]

    #training
    stack_training(X_t, y_t)

    #calculate costs
    t_error = mean_squared_error(y_t, abr.predict(X_t))**0.5
    v_error = mean_squared_error(y_v, abr.predict(X_v))**0.5
    print("{}) Training error: {:.2f}  Validation error: {:.2f}".format(batch, t_error, v_error))
    batch += 1

rmse = mean_squared_error(y_train, stack_predict(X_train, True))**0.5
print("RMSE: {:.4f}".format(rmse))









    



0) Training error: 16408.80  Validation error: 2083.72
1) Training error: 16412.72  Validation error: 1785.17
2) Training error: 16405.73  Validation error: 2291.03
3) Training error: 16411.19  Validation error: 1907.92
4) Training error: 16406.64  Validation error: 2231.50
5) Training error: 16418.28  Validation error: 1244.06
6) Training error: 15627.28  Validation error: 15137.17
7) Training error: 13758.32  Validation error: 26946.16
8) Training error: 13930.28  Validation error: 26134.30
9) Training error: 13555.17  Validation error: 27862.46
Training score: 0.0310
RMSE: 41833.6092

Evaluation

It has to be used the root mean squared error, RMSE.



In [177]:

    
from sklearn.metrics import mean_squared_error
import random

RMSE_rf = mean_squared_error(y_train, rf.predict(X_train))**0.5
RMSE_gbr = mean_squared_error(y_train, gbr.predict(X_train))**0.5
RMSE_abr = mean_squared_error(y_train, abr.predict(X_train))**0.5
RMSE_lr = mean_squared_error(y_train, lr.predict(X_train))**0.5
RMSE_stack = mean_squared_error(y_train, stack_predict(X_train))**0.5

def avg_predict(X):
    return (rf.predict(X) + gbr.predict(X) + abr.predict(X) + lr.predict(X))/4

predictions = avg_predict(X_train)
RMSE_total = mean_squared_error(y_train, predictions)**0.5

print("RMSE mlp: {:.3f}".format(RMSE_rf))
print("RMSE gbr: {:.3f}".format(RMSE_gbr))
print("RMSE abr: {:.3f}".format(RMSE_abr))
print("RMSE lr:  {:.3f}".format(RMSE_lr))
print("====")
print("RMSE average: {:.3f}".format(RMSE_total))
print("RMSE stacked: {:.3f}".format(RMSE_stack))









    



RMSE mlp: 23837.500
RMSE gbr: 22294.129
RMSE abr: 15578.859
RMSE lr:  34166.418
====
RMSE average: 17570.962
RMSE stacked: 41833.609

Get Predictions

Good results without data_gauss



In [33]:

    
import os

#predict = avg_predict(X_test)
predict = stack_predict(X_test)
file = "Id,SalePrice" + os.linesep

startId = 1461
for i in range(len(X_test)):
    file += "{},{}".format(startId, (int)(predict[i])) + os.linesep
    startId += 1

#print(file)



In [34]:

    
# Save to file
with open('attempt.txt', 'w') as f:
    f.write(file)

	1stFlrSF	2ndFlrSF	Alley	BedroomAbvGr	BldgType	BsmtCond	BsmtExposure	BsmtFinSF1	...	SaleType	Street	TotRmsAbvGrd	TotalBsmtSF	Utilities	WoodDeckSF	YearBuilt	YearRemodAdd	YrSold
0	856	854	NaN	3	1Fam	TA	No	706.0	...	WD	Pave	8	856.0	AllPub	0	2003	2003	2008
1	1262	0	NaN	3	1Fam	TA	Gd	978.0	...	WD	Pave	6	1262.0	AllPub	298	1976	1976	2007
2	920	866	NaN	3	1Fam	TA	Mn	486.0	...	WD	Pave	6	920.0	AllPub	0	2001	2002	2008
3	961	756	NaN	3	1Fam	Gd	No	216.0	...	WD	Pave	7	756.0	AllPub	0	1915	1970	2006
4	1145	1053	NaN	4	1Fam	TA	Av	655.0	...	WD	Pave	9	1145.0	AllPub	192	2000	2000	2008

	1stFlrSF	2ndFlrSF	3SsnPorch	BedroomAbvGr	BsmtFinSF1	BsmtFinSF2	BsmtFullBath	BsmtHalfBath	BsmtUnfSF	EnclosedPorch	...	OverallQual	PoolArea	SalePrice	ScreenPorch	TotRmsAbvGrd	TotalBsmtSF	WoodDeckSF	YearBuilt	YearRemodAdd	YrSold
count	2919.000000	2919.000000	2919.000000	2919.000000	2918.000000	2918.000000	2917.000000	2917.000000	2918.000000	2919.000000	...	2919.000000	2919.000000	1460.000000	2919.000000	2919.000000	2918.000000	2919.000000	2919.000000	2919.000000	2919.000000
mean	1159.581706	336.483727	2.602261	2.860226	441.423235	49.582248	0.429894	0.061364	560.772104	23.098321	...	6.089072	2.251799	180921.195890	16.062350	6.451524	1051.777587	93.709832	1971.312778	1984.264474	2007.792737
std	392.362079	428.701456	25.188169	0.822693	455.610826	169.205611	0.524736	0.245687	439.543659	64.244246	...	1.409947	35.663946	79442.502883	56.184365	1.569379	440.766258	126.526589	30.291442	20.894344	1.314964
min	334.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	...	1.000000	0.000000	34900.000000	0.000000	2.000000	0.000000	0.000000	1872.000000	1950.000000	2006.000000
25%	876.000000	0.000000	0.000000	2.000000	0.000000	0.000000	0.000000	0.000000	220.000000	0.000000	...	5.000000	0.000000	129975.000000	0.000000	5.000000	793.000000	0.000000	1953.500000	1965.000000	2007.000000
50%	1082.000000	0.000000	0.000000	3.000000	368.500000	0.000000	0.000000	0.000000	467.000000	0.000000	...	6.000000	0.000000	163000.000000	0.000000	6.000000	989.500000	0.000000	1973.000000	1993.000000	2008.000000
75%	1387.500000	704.000000	0.000000	3.000000	733.000000	0.000000	1.000000	0.000000	805.500000	0.000000	...	7.000000	0.000000	214000.000000	0.000000	7.000000	1302.000000	168.000000	2001.000000	2004.000000	2009.000000
max	5095.000000	2065.000000	508.000000	8.000000	5644.000000	1526.000000	3.000000	2.000000	2336.000000	1012.000000	...	10.000000	800.000000	755000.000000	576.000000	15.000000	6110.000000	1424.000000	2010.000000	2010.000000	2010.000000

	Alley	BldgType	BsmtCond	BsmtExposure	BsmtFinType1	BsmtFinType2	BsmtQual	CentralAir	Condition1	Condition2	...	MiscFeature	Neighborhood	PavedDrive	PoolQC	RoofMatl	RoofStyle	SaleCondition	SaleType	Street	Utilities
0	NaN	1Fam	TA	No	GLQ	Unf	Gd	Y	Norm	Norm	...	NaN	CollgCr	Y	NaN	CompShg	Gable	Normal	WD	Pave	AllPub
1	NaN	1Fam	TA	Gd	ALQ	Unf	Gd	Y	Feedr	Norm	...	NaN	Veenker	Y	NaN	CompShg	Gable	Normal	WD	Pave	AllPub
2	NaN	1Fam	TA	Mn	GLQ	Unf	Gd	Y	Norm	Norm	...	NaN	CollgCr	Y	NaN	CompShg	Gable	Normal	WD	Pave	AllPub
3	NaN	1Fam	Gd	No	ALQ	Unf	TA	Y	Norm	Norm	...	NaN	Crawfor	Y	NaN	CompShg	Gable	Abnorml	WD	Pave	AllPub
4	NaN	1Fam	TA	Av	GLQ	Unf	Gd	Y	Norm	Norm	...	NaN	NoRidge	Y	NaN	CompShg	Gable	Normal	WD	Pave	AllPub

	count	mean	std	min	25%	50%	75%	max
1stFlrSF	2919.0	0.173405	0.082412	0.0	0.113842	0.157110	0.221277	1.0
2ndFlrSF	2919.0	0.162946	0.207604	0.0	0.000000	0.000000	0.340920	1.0
3SsnPorch	2919.0	0.005123	0.049583	0.0	0.000000	0.000000	0.000000	1.0
BedroomAbvGr	2919.0	0.357528	0.102837	0.0	0.250000	0.375000	0.375000	1.0
BsmtFinSF1	2919.0	0.078211	0.080711	0.0	0.000000	0.065379	0.129872	1.0
BsmtFinSF2	2919.0	0.032492	0.110863	0.0	0.000000	0.000000	0.000000	1.0
BsmtFullBath	2919.0	0.143298	0.174852	0.0	0.000000	0.000000	0.333333	1.0
BsmtHalfBath	2919.0	0.030682	0.122801	0.0	0.000000	0.000000	0.000000	1.0
BsmtUnfSF	2919.0	0.240057	0.188129	0.0	0.094178	0.199914	0.344606	1.0
EnclosedPorch	2919.0	0.022824	0.063482	0.0	0.000000	0.000000	0.000000	1.0
Fireplaces	2919.0	0.149281	0.161532	0.0	0.000000	0.250000	0.250000	1.0
FullBath	2919.0	0.392001	0.138242	0.0	0.250000	0.500000	0.500000	1.0
GarageArea	2919.0	0.317792	0.144730	0.0	0.215054	0.322581	0.387097	1.0
GarageCars	2919.0	0.353324	0.152299	0.0	0.200000	0.400000	0.400000	1.0
GarageYrBlt	2919.0	0.266389	0.079704	0.0	0.213141	0.266389	0.339744	1.0
GrLivArea	2919.0	0.219812	0.095337	0.0	0.149209	0.209118	0.265543	1.0
HalfBath	2919.0	0.190134	0.251436	0.0	0.000000	0.000000	0.500000	1.0
KitchenAbvGr	2919.0	0.348179	0.071487	0.0	0.333333	0.333333	0.333333	1.0
LotArea	2919.0	0.041450	0.036865	0.0	0.028877	0.038108	0.048003	1.0
LotFrontage	2919.0	0.165431	0.072987	0.0	0.133562	0.165431	0.195205	1.0
LowQualFinSF	2919.0	0.004412	0.043606	0.0	0.000000	0.000000	0.000000	1.0
MSSubClass	2919.0	0.218457	0.250104	0.0	0.000000	0.176471	0.294118	1.0
MasVnrArea	2919.0	0.063876	0.111641	0.0	0.000000	0.000000	0.102188	1.0
MiscVal	2919.0	0.002990	0.033377	0.0	0.000000	0.000000	0.000000	1.0
MoSold	2919.0	0.473917	0.246797	0.0	0.272727	0.454545	0.636364	1.0
OpenPorchSF	2919.0	0.063998	0.091072	0.0	0.000000	0.035040	0.094340	1.0
OverallCond	2919.0	0.570572	0.139141	0.0	0.500000	0.500000	0.625000	1.0
OverallQual	2919.0	0.565452	0.156661	0.0	0.444444	0.555556	0.666667	1.0
PoolArea	2919.0	0.002815	0.044580	0.0	0.000000	0.000000	0.000000	1.0
SalePrice	2919.0	180921.195890	56174.332503	34900.0	163000.000000	180921.195890	180921.195890	755000.0
ScreenPorch	2919.0	0.027886	0.097542	0.0	0.000000	0.000000	0.000000	1.0
TotRmsAbvGrd	2919.0	0.342425	0.120721	0.0	0.230769	0.307692	0.384615	1.0
TotalBsmtSF	2919.0	0.172140	0.072126	0.0	0.129787	0.162029	0.213093	1.0
WoodDeckSF	2919.0	0.065807	0.088853	0.0	0.000000	0.000000	0.117978	1.0
YearBuilt	2919.0	0.719658	0.219503	0.0	0.590580	0.731884	0.934783	1.0
YearRemodAdd	2919.0	0.571075	0.348239	0.0	0.250000	0.716667	0.900000	1.0
YrSold	2919.0	0.448184	0.328741	0.0	0.250000	0.500000	0.750000	1.0